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' The S tandards f or Evaluatibh of 



■ Prb jecfes : ' 



and; Mate^^ials (Joint Camhitteei' i98ir Were developed d^lri^g;;;avperiai^ 
of four yeairi and pu by a joint conSnittee 17 mem^v 

be,rs representing 12 drgahizatidns assbcia£ed"witii eiucat^ 
evaltiation;, Tnese standards were developed in response to-^-aL'&^^^^^^-M^ 



mendation appearing in, the 1974 APA Standards for Educational afep i^vgHy: 
PsychdlogjGal Tests" " (APA > /1974) . The§;,^ represent an extension not:^^;)^^ 



briiy from tests to prbgfam evaluations fet also, ah extension f rSm 'a ^narxSw 
scope of concern for reliability ^^a^ into a widi pir^^^ 

tive on evaluation (Neyo, 198^3) , and evaluation . stand&r3s ; ;a?hey 
focus on *fbur Jiajbr groups bf stand utility^ feasibility., . 
prqp>^'¥et]^ and' accuracy • . ft seemed reasohable tb apply these four 
grbupi '.of standards also to testing methods, and not to limit their 
use ta evaiuata^ons of projects and programs. Such an application, 
could provide a wider bas^Ls f btl the development bf a cbmprehehsive 
set- of standards for educatibhal as wall as psychologic 




' ; The Joint ebmmittee 's '36 standards for evaluajtibn of programs, 

^ _ _ _ _ ■ ___ ,__ A ' . .^^„/'i:-__ : •• 

projects and materials were used to develop 23 standards for test- 

ing methods. Pdr^llel to the Joint Gommittee's stand^ds, they / 

were organized in four grbups of s£ahdards: Utility > Accniracy; 

Feasibility and Fairness. Fbllbwihg is a descriptibn*bf th^ese riewiy^ 

developed groups of standards: ^ V 

; ^ ' . STANDARDS FOR EDUCATIONAL TESTING METHODS • 
;A.V _ Uti ii ty S tahdar ds 

The uti lity>' standards are infj^ided tb erasure that a testing 

method will' serve the practical inf britiatibh heeds of given 

aiidierices • These standards; are: ^ ' 
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A-1 ftodxence identification 



..\^^^-:l/;y'._ '--: Audiences involved in ^ o 

if;' . i be identified r so that their ri^eds can be addressed, 

v/Jr A-2 Tester Credibility v 

' The pef sons conductiTig t testing should be both trtfsfe-r • 

; ; worthy and competent to perform the testing ,^|^ so tha€ their 
findings achieve maximuni credibility and "accept^ce . 

A^3 Information Scope , 

; V ' Information collected by the test(s) should be of such 

I scope as ^ to address pertinent questions about students' 

g V achievements and be responsive to the information needs« 

arid interests of specif ied audiiences . v. 

, ' . _ ' ■ ■ '' * . ' ' 

%\ ? I A- 4 Ju s fcif i eia Cr i ter i a 

5: ■ :^Cirviteria used to deter and marks are 

. \^--.Giearl5^v 4^ scribed and justifies ; 

Testing . reS^^ presented in forms readily understood 

■ ^ by id^ntif i^^^J>^diences . . 



A-5 ; Report; Piss eti^^l^^ . ' ^ , ; 

Testirig rebults;?*^^^^ to all relevant audiences, \ 

so/that ffi^y ^an-#^^;, and ' 



A- 7 Repor t ^-Tlineiiness^ 



Release of testing results should be timely, so that 
- ^ ' -^audiences best use them. 

A-8 ^ Evaluation Impact ^ : 

Testing has : a positjre the\teaching and learning 

process and on ^the decision making processes of all parties 
associated with the testing. - 

■- - •; ■ ' » ■ . :v ■ ' ■.*•■. • 

B. Accuracy Standards ^ ' 

The Accuracy Standards are intended to ensure, t^ a testing ' 

nietfiod wiil reveal and" convey technically adequate infbrniatibn 
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dri the educatibhal- achieT^^ of those that aire being teSti^d. 

These standards arer 

I- 1 yalid ^Measurement ; 

Testiiig is conducte<|:j by ijlstxuments and procesiixres provid- 
ing valid information for a gd^ven use> ^ ,. 




Testing is conduct^ by instruments^ an^ f^r^^ 
ing reliable inf ori^tion for a given use. > 

• - - -■ ^: - - ^ - ' J |W , . — ■ ■ : ■ ' - 

Testing Cbhditidnsiv?') . " 

festijig conditions i.J^^ described in etibugh/ detail , sp -that 
their adequacy ca^^ cdhsidered when assess- 

ing the achieveSehts of each student. : 

B^A Test Security . - ^ ' . 

- * ■ . - . ' ■ - ■ ' . ' "■ ' ' ' ■-' ■ . ' .' ■ ■ '■ • ' '■ ■ ' 

Testmaterialsarid testing^p safeguarded to > ^ 

avoid fraud aifd cheating . . - 

B-5 Data Atfalysisi . ' / . V . 

Testing data are apprppr iate^^ ahaiyzedv 
tq ensure supportable interpretations of test scores. ^ 

B-6 Objective Reporting / ' ; "- V - 

Test results are repbfted objectively without distpr-t ion 
by personal feelirig^^ and biases bf testers and scdr&trs. / 

Feasibilit y Standards - 

i A . ^ ^ . ■ ■ . 

The Feasibility Standards interided to ensure that a testing 




C-1 Practical Pr6c€ 



Testing is conducted with_Siniin'Uin di^^ of educatibil- 

al and administrative processes ^at sch&bi and with cons idr- 
eratibri bf existing constraints^ ^ 



3(^'i3- ^:g:l±t ix:aI"^Vj.ab i 1 vt.y 




Testing, is planned and /coniiucted witii* antieipatibh of t±ie: 
-diff gireiit'*pdsiti groups, so that ■ 

tJ^^iJ^ 'Coop^rationVm i • 



C-3 




l^si^ikc; does ^Todnce^ jii^fQrmB.tioTi of suffierient traliie tp^ 
jb^tlfy the resbu^ce^Qxpended^. . ? - f 



Fairness S'taildarjds . / j } \ / 



The Fairness^^ndar^s are intended to insure £hat a testing / 
method is coiiducte<^ -l^ally , - ethically^ and v^lt^^ due rega2;d 
to tieM^lelfarfe of-te as well as those affected 

6y t^st results . / Ti^se 3-t:aStKi;ards are: ' ; . 




%es^s are 'based on siibgect matter and 

. •cr:itiBria./\-- : ; -. ^■>- ■ * ■ ; ' 

Dr2 Rights of Htiman Subjects. ^\ 

Testing is designed and: conducted , s that rights/ and /wel- 
fare of huinan subjects are xespected 

' ■ ■ . ;■■ ■/ . ■ ■ : '■■ ■ . ' " ' ••■ " 

b-3 : PAibli'c'^s Ric^ht t KnbW: • , ^ ^ 

The public's right to know the results of tes.|:ing and its 
' ; consequences is respectdd* within the limits of , other related 
prihcipies such as those dealing with public' safety and 
] V the] right to privacy. :■ \ . 

b-4 Coixfilct of Interest ^ " ' - 

\. Conflict of interest,, frequently unavo is dealt with" 

openly and honestly^ so that it does not cdmpr ami se the 
: te'^stihg process and re^sults* ; •/ , 

)-5 Social. Values . ' 

/Testing is conducted- in afrcord with social valuas_^^^a^^ 
not: stimulate violation "bf nQrittS and values acqepted at; 
school or sodietyi ^ 



y . Test results are complete and fair in their p'reseritatioh of 

' / - s'trengths and weakness^ of the individua 

^h^ purpose of this study was to test the validity and applica- ' 

bility of; the newly developed standards . They were applied to 

assess four alternative testing methods of oral proficiency in 

Englishr^s a Foreign Language (EFL) . The , four testing methods were: 

an oral interview, a role play ^ a reporting task, and a group dis- ' - 

. cushion test . These me thcds had to be assessed to develop a recom^ 

V meiMl^tion for the Ministry of Education in IsraeT regarding the / 

adaption of an appropriate procedure to test oral proficiency in 

English as a Foreign Language within the matriculation exams admin- 

stered at the end of High School to alt students . It was apparent 

at such a decision cbfiid rtot be iimite& to validity and reliabiX^ 




it^y, and a wide^scope , of decision criteria had to be used, for this 

^PP^^'^^^^ the extensive scope b£ the Joint Ccramitte^'s Stan- 
dards seemed to be . a plausible approach to this prpbiem. 

.Before proceeding with the study design and its findings^^^-^ 
short .discussion of -testing methods of oral pr.oficiency^ on which/' 
this study focused, will tie, presented. : . 

^ V AtTERNftfi^ TESTING METHODS OF sfe ORAL ^PORFIGIENCY . 
The increased interest in, the teaching ^^d^^^^^^ 



skij-ls has brought about greater emphasis on bpt*it;i^ and 
the testing df^i^^^pro oral perfoaSSnce in v^omm 

cative situations is one of the most diffdcult skills to a4sess. 



Although in the past decade several att^pts have been mad^^^^ 



f ^ pvplop hP^s l^ .c; hh ah. w o uld p rovide bet ^e^"^n^s^r^s ^f^-b ^ ^^ ^ ^ 

cieiicy (Madsen & Johes^ 1981) , the research these' 
, tests is still very limited. ^ ; f 

- Currently in Israel EFL oral prof iciency is testfed within th:e ^; 

. ■ ' . - ■ ■. '■-■'i- ■ . ■ ' - - ■■ ■ ■ • ■ 

framework of the high ^school leaving examination ("The HatricUlatipn ' 

E5cam") administered nat*ionally By'the MiHi of Education. The 

testing procedure Us a cdnversatiorf in wh^ch a tester in ter^^iews ■ 

each student individually. Sttadehts' perfbritiance on that test 

provides fe^ basis for the oral prbficieticy ^^^^^ Several defi-- 

• _ _ v___._ 1, _i : •' _ _• _ _ _ . • ■ 

ciericies seem to^be found with this procedure : (a) The c^: a 1 inter - 

■ . ■ ■ - " ■ . - "■ ■ « ' • ■ _ ■ ' ' ■ ' " ' 

view test represents a harrow domain of orar performance/ and it^ v 1^ 

is therefore qtiestionab;le whether it is a valid indicatipn of stu- 

■ y.- : ■ __ ■ -- - - - '^i- - J %- - J ' •• • -.'-^ ■' 

dents* overall oral pr6ficiehcy.:v:(:b) Since rat^^^ 

" ■" , . ■ ■ . ■ ■ .- ' . ■ . ■ ■ , - .■ - • - - • ' 

. riot asseissed/ it is questionable whet^ the score obtained by the -e*^ 
student is'his "true score", especially since the testers are not 
< trained in either interviewing techniques or rat:ing;/qra^^^ 
^ cieiicy. (c). The test has very low variance/ arid relatively hi^^ ' 
scores; literally nobody fails Bie test. This hafe caused some 
of ficials iri the Ministry of Mucation to^call for the abblisItmeSt . . 
^ of the oral test/ sirice it provides little information conp^ed to 
. --itS' cost ... /' ■■■ - 
. J : ; S^^ . 

. tutes a prdblemy -slnce among the tests available hardly any have^ ^ . ; ; . 
been sufficiently/ researched t their impl^entation on a - . 

V' ' iiatibn-widfe scal^. The 6nly oral ^est that has been researched - 
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extensively is Jthe FSi (Foreign Ser^e Institute) Oral:!^^ 
(Bachirian .& PalMer, i981; Shohamy, Ip^^ Hinpjfbtisr 1976 Cliff 

s" of tiliat test : tiLat . 



• - ' 1977*1 i H6wayer.> one of lihe'v ffiagor shq^^tccmi 

^ it does not' encompa^t^ w'ide fentbufh variety of speech' styles 
. (Shbhamy., 1983) —it is limited to qu^stibiis asked" j^y the tester _ / ^. 
> and answers supplied by the test-taker . ^ it ls-,obv^^ 

test is not comprehensive enough to assess- ^11 the aspects of ^.oral_ 

proficiency.' ,/ , ' , . l^iv' ' ^ ' ■ , / ■^■vVjv':,';: 

_Thus, an attempt was m?de to develop a c^prShensive faa|te||>^^^;^ 

of oral proficiency teits, r|p|es#htin§ four dif^f^:^ 

; : .styles. Following is i description of tjiese/four pst^: ■ ;i ; --^ii^:.' 

■ ■'. (a) The Oral fnterview (01) ; ; _ / > ■ ^ . • . 

The rationale underlying this test was^to guide th^ test- • 
' - ' tak^er into a dialbgue with the tester wh^^ 
. ^ answers to questions asked; bffSie. tes 

; topibs . The te§it encorapassedla variety of . topalis and ^^^^-^ 
^ v> rep^e^eiitea Ibw-high role reMtibnship Between the pat^ .t- 

ticipahts. The test f billowed the modeliv 
: Interview (towe, ^981; Ollerl 1981^; 'iihdf^^^^ ] 

" / 'taker is pushed to the highest v'level'df; his "oral p^^^ ^ S 

ficienc^v' The test consisted of four p^ase^.V U) warm-^^^^ 
. up, where the test-take^ was put at ease a«d tester - ^-v 

derived a preliminary indication bf the tes'tr-taker ' s , " ' 
> level 'bf 'proficiency; (2) :level-chec]|^:w^^ 

; : cii'ecked the fuftctioiis and content wftib^'JJie^ te 
^ • could perfbr%jiQst.,accurately;> (3) prob£hg^^where the^- ^ .v.- 
tester assessed theilli#i4st^ level at; which the^t^^ 



cbuld f urt|tib« aecurl teiy ; ahd,:^ 4) wind-up, where the . . 
test-taker^j6*asAretufn6d,;^ which he could 

function most, cbmfortabiy ^ The iscprihg 'bf t^ 
■ terview '^as done on the basis of the same' J^ating scale' 
used f or Wll ■■the'''btfter 'tests. ■ ■ -•'■^i^v -^^'^e^^ 

(5) The Roiy Play (RPL)' • . " " -'^ :," ■ >\ ' _ ^ -'''y"^' ' ' 
The rationale be^M tjvi! 1 1 

taker to produGe: ; ip'britaheous sp^ wi thtn. - 0:- 



limits if a^seudd--autiientic ;situatid^^ 
dialogue Be twee|l. two participants who te^jres^nted- y 
role-relationships between ^the speakers (eqyal,^! 
high-low)., and the 'level of ^^ f required, 
by the specif £c simulated^ si tu^ test-taker was^ 

^iven a ca^-d qn which he found- the <3e^^ 

a tibn, and' his expected role in i jb , J The . tes te r then eh-^ 
gaged in^th^ 'simulate sitti- 
atidh . The test; lasted - for about ten minuti^s, and 
score Wasvassighed,^^-^^ assessor; who was not; involved in 
the RPI> -bn;^^ sMie r^tarig .scale us for >^ 

■ail 'Qie-i'bth^'.'.tssts ,- ■ - ' . ■ 



V (c) Th e Re p or tin g- 



".V ' 

- Fi 



* : l - The ratibrikle li^aeriying^^ • V 

•5^ ' C? ^ test-taker ihtb*-a mdndlpgyie^^^^i^^ : 

;v ' f input in the ^iiidt^^ • 

r l''^-' : a unilateral skill dfj;^^^^ 
: / ■ : "l^^ ship^ between th^ the listener was; IbW to high>_^ 

». ? and the^^^^ 

- ■ \ .'-^^}'' tkst ^eir^; :cbr^ explaining and .reporting,^ *rhe : 

. I ' ff: student was er^ven an article in Hebrew ^ which he was asked 
: ; \ ' ■ ' td^ rg^ sijtetttl^^^ 
^ ; - " f ■ : ill his translate tdie text^ ^ \: 

: . . i- ; /if referring back "^o th^ text ' ' . 

[ V vJ , ; ; ^.V test iastpd about 10 minutes and 

\ :1 y >: " ^was ' icqreS, dnj. t^ of thi same rating scale used to 

i ;f .J; ^^'i ( ; ^ the qther teSts^: 

':4 J ' A: , / (^) /" The Gr diip: D iscus siV;:^n ( GD ) : c * 

f ; \ .^^^^^ • ; ':^>^ this test was to stimulate the 

- . test-itak^^ spdhtahedUs _d;iscussion^^ of a con trover p 

: ; ' /'V ■ . r sialiis;stiei; in v^hich they could Express views about topi- . 

k-cal SaJS^Sir deba:t^ argue dyer theni/^^eij^^^ \ ' 

- ' 1 ^ [^'^'j::../'^^-': %pini^3 ^^ Qth^iT paTticipahts^^t^ _ 

^: / y r • accej^t 'them*; LTMs^^feesf ;^equired^mult^^tSr^ . 

/ -f^^-'-'- ' i -CU' r^'tloi} and J the role "jeelationship: amorig\ the j^lir ticipants wSS : ; ' :V 
.. ■■ ■ ' ^ ; V -^^uai^ v^ 
V ' . ■' r-^r/^-i ject. 6r ;i;ssue'.;COntrover^ial enidugh to l&iad itself to a /'^ ; : 

' •,-^-""':.'/ '/" I-V ■ ^Ji^ly";-^-^<iuss^ ^pickedz-a. ;:eaa^^ 

/ ' ^ ^ i • ^ ■ ' r^ 
[':;^'-\ ! I V \) reg$rain#:^e 
: - - '5 ;conducti : T^ 

i: > - - v ; . ^nd plkh the prqcedure of their^discus&idn amdii^ 

; "^Y .Tv\^^^^^^^ ^ befpre It^rting^^^t^ 

: : : ■ : : ti^^ lis t^ed ' to '^/^ is cus sidn without J^riter^ /-^y/c ; : 

7;-\/.-'--- . ■■ -vscore^ ^ tOie ^erf ormarice * pf esich of the4 f our te s t ^ t aker s : ; cm:/ ; ^ -s- 

; v^/;: ■ >■/ ■ ■ . ' ^ */'^. ■/.^ /(;■■ ■ 1, *i^Mf$x^^J^^ , ■ 1 1 ;-;th&^"^0ther\1^ ■• 

*> • ■■•^ /■ /•- . y . ■ _'..,:/.^v, , /^' ^'^"^^ ■'^■^'^''^^^ 
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^ THE STUOy DESIGN 



The study ^reported in this paper is Based , on, three sources: 



(a) an experime-ntal try-out., of the four? testing methods (b) ah ' ' 
evaluation . of the testing metli^di b|r aipane.l of experts , and (c) 
an analysis of. the same testing;^metboa:s by policy m^^ 'Follow^ 
ing is a short description of each:^of the three sources;, 
(a) The experimental ^try-out ]v - 

The' four alternative testing methods of oral prof icien*cy wer:e 
tried out with, a sample of 103 twelfth: grade students of four 

_ _/ .A : ' . ' « _ 

plasses in a comprehensive high school- north of Tel Aviv .The 
classes wer'e randomly selected ou^t^'^of Ts^ven classes in that school. 
. ;A'll ^tudent^ took-vail f our tests . The. -administered ih-^ 

dependentl^ an4 lasted; ; fox about vte,ri . minutes each^•> To minimize the 
^learning '[effect possibly created by the order of the tests, groups 
of istudents }:ook the tests^in^^^^v order^ so thart total rotation 

was en^jured. The tester^^^o were assigned to administer the tests- 
were experienced EFE teachers who were trained in ^ administering and 

\ : . : ^ ' ^ ^ ' 

rating the dif ferent tests . T^e rating .of students • performance 



was- ;done dh the -spot using a- i^a^ Clifford and 

Lowe {1,9'81} . : It ira ted oral proficiency on a scale ranging from 4->'^ 

: to^lO > -IQr^being equi^ performance. The 

tests Were 'a4|:' tapfed ;to allow for- ari" additional rating iii order; to 

^compute rater reliability.'. . 



. ^F©r a. d'et^il^d description df tJris ^tuSj^ see Sh^amy , Reves , & 
^ Bejarano (158,4) . ' f 

The scale of 4 'to Ip :is the conventipna:!. scale regularly . 
Q • ; .the Isjrateli Schboa^ .System i ' | . . 



ie 



" biv^oSpfet^ each of the f otir teto the - slude^s flll^^^^ 
■but a questibiinaiie wfiich asses ^ed^ their attitudes^ towi|<J|the four 



^Vo w^4s^ ;a the admiftistratibri- thfe foi^ tests 77 - bf^^e^^ 
10 3 students Were^ tested by to; e^ -fephveritional.. test ' (The. 

Matrrcul^itiorf^ ExaIn)^in^ ^^1^^^ 

th^ comparison between til gx^eriip^ntal tests and the ^1^^ 
was'. done . ' - V, ; , ; ^ .v:; 

(h) Evaluation by experts 



: : A^. gfcoup, of ; sixteen ^ language testing experts , attending a .cbti- 
VeHti^n on resefeh^^^^;; ^ 

■ (tailed' deicrip^ o^ai proficiency .tests as well^^ a^ 

(; : sbm.4research ftt^gs .^^ 

%xposed; tb-the Standards for Education^ Testing . Me thods^^ p^^ . 
, '^:^in this:: paper . Following vthe discus si.oh of the four 'testing ihethbdsy : 
•'the experts were asked to:rank each method according to Its :ftcaur- 

■ Vaoyiv' ut^ Fairness . .. The ranking was done in-. ^ :.^^ 
' - dividualix.:w^ing a :? otm /point ioal^ ^ ^^'^f^-^ - ■ ■ 

: for .each, sta^^ ^P^^^li?.^^^!^^^^^ 
: ' def ifiitioh of each standard i ai a reminder to the experts On the v . 

each standard regarding each testing method / ■ -■ ■ '.■■^ vl":;:/:.^;;.;-::-- ' • • 

I c.i Analysis b y jfolicy makers ■ • V;; ■ r .:^\,.:.:.: :- 




of-'^ucltiiori;^^^^- & 
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of tfie Matriculation Exaihihatioh^ several discussidhs were held it 
ythe Ministry. regarding t^ issuei Senior administrators > a^^ 
ated with EFL instruction and testing; and. t^ developers of ti^e 
four ^ternative testing - methods > participat^^d 1 those dis.cussfb^ 
As a result-^f the discussions a decision was ma:de to f urth^er exper- 
iment with ah integrated y of the four testing methods witfc a 
sample of 1000 12th grade students to substitute Ibr theg;existing 
MatHculatioh oral test. We use these discussibhs as a^ case study 
td demonstrate some interesting points to, the process in , 
which, pbiicy makers use evaluative information to assess the merit 
of alternative tes^tihg meth<S^ ' . 

"* '.'RESULTS; ^:r-_ . ■ ' ' " " . , . 

We shall report biir fihdings f b^.;_:^^h Igroup o ^ 
garding the four testing methods ^n the basis bf the. re levant ^ ^ 
obtained from the three sources of- our study 

v(a) Utility standards ^ ^ 

The uti lity standards are iritehded to ensure that a testing- 
method will serve P^s^tical inf or mat ibh needs bf given audiences 
to have a positive impact oh the teaching and learhin 
well as bn the decisioh making process of! those as^pciated with t^ 
testing and its results > 

fts can be seen in Table 1/ the igrotip bf^^^^^ testing ex-- 

perts ranked the Group Discussion (Sp) test being the one with the 
highest utility value among ^t^ testirt^ methods. Thi high . 

rank wks justified by sbme^bf the exfiert the^ positive back- 



wash gf feet thai: the GD test might have on ifistriictibh^ stimuiatihg 
teachers' to ^ilocate^ t f or Siscussiph in their classed • The 
group of policy Sake cbnsidered also the back -^ash effect of the 
Group Discussion test as an important feature ^^^^p^^ test and de^ 

cided to support its p6saiiD|leyiise in ;tlje future in spite b£ sbme - 
logistic difficulties ass^ociate^witli its^ 



relatively low accuracy qualities . / 



Insert Table 1 about here 



It is interesting/to no^ that while the GD has been 

raiiJced highest bn Utility/ it has been ranked Ib^ Accui?acy; 
At^'fehe s^e time the experts ranked the Oral Interview fOI) test 
quite low (31 6n,^Btility,^^pite of the fact that it was ranked 
highest on all other three standards. 
V(b) Accuraby standards 

The Accuracy standards a ensure that a testing 

method will reveal a^^ arid otherwise tech- 

nically adequate information on edticational achievements . The ex- 
perimental try-out as well as the evaluation by the language test- 
ing experts provided inf ormatibri oh the accuracy of the f bur test- 
ing methods included in our study; 

. One of the concerns of the Ministry of . jsduca^ion regarding tfie 
existing' Hatriculation oral prbficiency test was related to its 
relatively ^igh scbres and tJieir lew dispersibn. Some /bf the bppbn^ 
ehts of those tests afgtaed that ""since almost every studaht gets 
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anyhow a high score on this test why waste on it ^b mtich time arid 
^efforti" And indeed for the students/ who^ participated in the ex- 
perimental try-but and took als*© the existing Matriculation pro- 
ficienby test, a mean score of 7.79 and a staridaird devi action bf 1.03 

^ ::were; obtained on /this test (iee table. 2) . ■ " : \. - . : . , 



■-.Mi. 



insert Table 2 about heire: 



- Cbmpared to the existing test/ Ibwer mean scibr^^ ;ah^^^ 
standard deviations were obtained fbf 'all four exp^erimehtal . tests^^ 
Among them the Group Discussion test seemed to have the ^ lowest mean ^ 
score (1x^6.06) with the highest standard de^ The 
lowest standard deviation (S.0.=1.32j was obtained for the Reporting 
test. Cdnsid^ering the relationship between variability^ and irelia- . \ 
bility, the data on the standard deviations of the fbuf tests cbiild 
be some kind of : reflection of the reliability of the tests. Mainly, 
reliability assbciated with errors of measurement that apply to 
test content itself rather than the biases of the scorers. 

More direct iiifdirmation on the - tests ' reliability can be ob- 
tained'lfroin the findings for the inter-rater reliability^ fts can be 
seen in Table 2, the highest ihter-^r 

for the Oral Interview (r=.9r) . The inter -rater« reliability for 
the Reporting test was r=.Sl,: and fbr the Role -Play i 

' The ranking of the tests by the level of their inter-rater 
reliability seems to be in general agreement with the overall raijking 

*The iriter-rateri? reliability for; the; Group Discus sion test has not : 
Been cqmputed yet at the present tiine. . ^ ^ ; 
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for accurabS^ provided by the panel of experts (see Table 1)-. They 
a-iso ranked the Oral Interview highest on Accxiracy; second highest 
they ranked- the Reporting test; third--t^e Rb the Group., 

Discussion test was ranked lowest on this' standard . If we consider 
the findings on inter-rater reliability as a valid; in<|Epation of one. 
'('■Mp^tlM: tehi ac^si^y J-r':^p^is: applSrent: that^^there ■i;d|#^eeinen^,J)e|^y 
^tween the data .pbtained |r the. experimental, field, tr;^^ 
. judgments provided: by a panel o£ exEserts . .. 

■ . • ' .._-V, ^ '■' . ^_ ■ ■' ■' , ■ . ■ ■ .;, '.•-,■• 

■■ : :' ' (c) Feasibility standards 

Administering the four tests within the framework of the ac- 
perimental try^bUt , suggested that ail of them can be implemented 
as feasible testing methods to test oral proficiency without any 
major dif f icuitires". ' The testers , who were in inost cases regular 
EFL high school teachers, went through a relatively short and sim- 
ple training process, and succeeded in completing each test in ten 
miriutes per student. Regarding the feasibility of implementing . 
^ese tests , there :^eemed to be no .apparent advantage for any sin- 
•gle test , except f 5^ the Group Discussion test which Sid create some 
difficulties in reaching unifbrm procedures among testers and over- 
: doming : some logistic problems in Coordinating group testing ses- 
sions for .stadents who took all other tests on an individual basis. 

• ;Ther stxidehts participating ah -the experimental try-out se^ed 
tcy be enjoying the testing experience^ although in their question- 
■ naires they, showed some pref erence for the 01 . and the RPL tests . 
. The pahel of experts (see Table i) ranked the Oral . Ihterview 



highest oh feasibility and the Role Pi<^itest as lowest; The sec- 

ond lowest on feasibility they ranked the Group Discussion 

test/ as was also indicated by the experience gained frdtti the exper- 

iment^^try-qut. 

_ . ' ■ ■ ' _ i ' , '_ 1 _ J l' _ _ ^ ' 

The policy makers expressed concern regarding the feasibility 

. - - ■ . - - ^ , - ----- \ ii- 

oi introducing the newly developed tests into the system in terms ^y* 
^f ^c6sl;i 't^ of 'testers.- They were, especially, 

cpncerjiied about tfte logistics of administering the Group Discussidh 
test, in con jAinction with the other tests administered on an individ- 
ual basis . . . 

V (d) ' Fairness Standards " V. 
Two major sources bf information were -available in this study 

• ' _ _ _ ■ \ - - - 

regarding the fairness of the four testing methods; the student 
questionnairev.^dm .theji:^per try-out and the rankings pro- 

vided by the panel of ■ianugage testing experts (Table 1)* At the 
^nd of each test studenBs filled out a questidnhaire in which they 
were asked to agree or disagree with a set of statements expressing 
their attitude toward the test. One of those statements was "This 
test reflected my true knowledge in speaking English." Students' 
responses to this statement are presented in Table 3. 



Insert Table 3. about here 



.■ , . ._ _ _ I . _ _ 

If we- consider this statement . as a possible expression of test 
fairness, we can see in Table; 3 th^t the Oral Interview was. per- 
ceived by students as the f airest opportunity to express' their 
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knowledge in speaking English. Almost -85 pgrbeht of the students ; 
agreed (or strongly agreed) with the statement and its mean rating: ■ 
was ^^=3.88. The second-best for ^^airness cam^ oat t fie Rqle Play tes 
for which more than 70 percent of the students agreed that it re- - 
fleeted their true knowledge _ in speaking English.; Students' op imT? ,; 
i,ns seemed to be balanced oh the Group Discussion test but were 
somewhat . negative regarding the Role Play t^st-. vMOt^ thah || pe^ 
cent of them ,did not think; tH^ this test reflected their tSue knoifl- 
edge in speaking English. ' ^ 

Hsing the mean level of students' agreement wi;th thd staternent 
for each test, we could rank the four, tests for Fairpes? from, high 
to low as follows: Oral interview. Role P lay; .Repoif ting r and Group 
Discussion, /if we compare Table 3 with Table 1,. we" w:il;l find that-:' 
students' perceptions *on tests' level of fairness Si ffer from tho^^^^ 
of the teiting experts , except f or the Oral Interview r Both grbups .; 
ranked this 'te'st highest on Fairness bu€ strongly disagreed^ on/ the . 
Role Play test. This test was considered as secbhd best by students 
but was ranked lowest' by the experts (see Table 1),. Unf^ 
^testing experts do not consult student! whe^iever tpy^a^^^ 
' rate the Fairness of a test.' ' : . '•' ; , 

; . '_■ , . ' ,■ ' C \^ ' . 

SUMMARY AND DISGtJSSIDN . ^ ; - - • ' ... 

< • ■ • ■ ■ ' 1 • ■ ' _ .. * 

Our study demonstrated that the Joint earanit^^^^ 

could be adopted for testing methods anc? us'ed as a framework to , ' 

analyze and assess the merit of alternative testing methods . Being 

" conducted in a context of a real 'decision "makirtg process , this s 

• ■ « ■ • \ V ; ., ;■ , . ; T _ r:..:\ : .■ 
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. shdWei- that: such a fr^mewirk prbyide^^^ of information ' 

• i^ie^^ant to decision makers^ ■■ Decision nickers were isterested- in 

^ :: thi rinf drmatibfl ' tegardini? the Uti3:it| '^hd -F^isibil vari- 
testi r afid did hot limit their interifest to Aqcur acy;, when they 

, j cSnsi^ered ;the introducti^^^^ th^ -newly develbgied. tests _ into the^. 



r 
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' ' ^i¥^ bankings ' of ^ t^^^ distinct 
^;tidn B&?Wen \he ^ualiti of thi fc^r tests apco^diSg to the var- 
ibus stShdard.|;VVbne ■6x:am^l^ which was ranked^hig^- 

eit oo:^ali'" Standards except qtiiit^ ' Ahother^ exkiple -was the GD| 
test, yiSh wai ranked hi^hesi oft Utility. But Ipwest on Accur^aGy., ^ ■ 
^hese f jndihts jugges^haj: test iftg- experts shr;>u1.d^ not limits thMl^ 
selves' td^ fehe'tecfacal and use the wifle scope 

of all f our '^j&andatds to judge the; nierit _^, test. ' , '; 

try-out" 



■ . V ecmpar £ng the i-esUlts obtained- f ran the 
-stud)^ aiKi' thev^ prbvTcled by^ the panel of experts:; suggests > 

that testi-hg, ex^irEs" seem to be better in .judging a test by one ^ 
standard "than- 'an.bther.. ;Th assessments" pf the accuracy of ..th^^ 
four oral ^rofibienc^ -tests were iixt -Strong^agreSnent wit|i:pfie-f ind- 
ings obtained f ran" &y-out^ regarding inter rrat^^^^^ ^ 

.reliability af. tfieSe €ests\ At the same- time- there was a . lack bf • 
agr-eement between experts ranking on Fairness and students , P^-y 
ceptions of test f airness as; expfess^ ^".^^'f^ tjues.tionnaires ^ ; 
' V- 'l' Mthbugh the study provided. -sOTe iSt^esting observations re- 
garding :;the\appiicabiiity of ^ ^le assessment of- 
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testing ifleth6di> it Was based m^ihiy' bri seCdndary^ sources of infer- 
matidh and provided only i partial attempt to study whole sco^e 



Sf .ttxe Standards. Mbr|^ systematic eff^^^^ in this directibh should 
, V be encouraged ih -the future 
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Table 1 





of Tests by Experts, 




}. ; v^. ; :acccard i the ■ Four 


Standards 






'•..t" . . ' . . \\ ■ . 




Standard " . 






utility 


Accurac 


iy Feasibility 


.:E^-*n^SS 


Oral Ihtervriew (131) 


■■'■'■3 




'■■^■'J':- 


r :G-:.i-n.:;V 


Role Play (R£P). i 




3 


4 


4 


Reporting (REP) ^ 






2 




Group Discussion (GD) 


1 


4 


^3. ' 


- 2 



1 = High 



4 = Low 



Table 2 

' .i ■- . ' ' ■■ 

Mean Scores , ' Standard Deviations and inter-; 
rater Reliability of Oral Proficiency Tests 



Test 



Oral Iht^erview .(01) 
Role Pla^'y ^RLP^ 
Reporting (REP) 
Group Discussion (GD) 



Mean* 



6.49 

6.57 
6.00 



S.D 



1.39 

i.Bi 

1,32 
1.93 



.Inter -rater. R< 



.91 
.76 
.81 



Existing Matriculation, 
Test ^ i .19 



1.03 



n = 103 



** 



n ^ 25 
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Distribution of students' Res'ponses to Statement 



. /'This 


test reflect^ 


ihy true- ] 




XXI 








i . speak 


ing English" by 


Test (in 


percent 


a^e) , 










' Strongly ' Agree )D 


isagiee 


.■ 

vStrbhgiy 
: disagree 






bver- 




:agx«e- v;; ■■:-.v^.^vf 






■■■■■"■^■■■■■■■ti') 




X 


rank 


Or a 1 Interv iew (0 


IT ■ ■;'17V5 '^- 


66.0 


14^6 ■ 




-3 


.00 




Role Play (RtP) 


9.8 • 


•62.7 ■ 


26.5 


1.0 


2 


^81 


2 


Reporting (REP) 


3.9 . 


35 .0. 


52.4 


8.7 


2 


.34 


4 


Gtod p D i s cus s ion 


(GD) 5.8 


45.6 


41^7 




2 


.50 


3 
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